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Abstract 

Based on a modification of the "Dichotomy Algorithm" (Terekhov, 2010), we propose a parallel procedure 
for solving tridiagonal systems of equations with Toeplitz matrices. Taking the structure of the Toeplitz 
matrices, we may substantially reduce the number of the "preliminary calculations" of the Dichotomy 
Algorithm, which makes it possible to effectively solve a series as well as a single system of equations. On the 
example of solving of elliptic equations by the Separation Variable Method, we show that the computation 
accuracy is comparable with the sequential version of the Thomas method, and the dependence of the 
speedup on the number of processors is almost linear. The proposed modification is aimed at parallel 
realization of a broad class of numerical methods including the inversion of Toeplitz and quasi- Toeplitz 
tridiagonal matrices. 

Keywords: Parallel Dichotomy algorithm, Tridiagonal matrix algorithm (TDM A), Thomas algorithm, 
Poisson equation, Toeplitz matrixes, Fourier Method 
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1. Introduction 



The realization of many numerical algorithms, such as the Multigrid Line Relaxation [l|, 0, ADI and 
Separation Variable Method 0, 0, d, @ , Cyclic Reduction Method [J, [H[ for solving elliptic equations and 
also the problems of construction of splines 0, H| etc. requires solving tridiagonal systems of equations with 
Toeplitz matrices 0, [l(| 
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Since, in solving modern problems of mathematical modeling, the size and the number of such problems 
can reach several tens of thousands, such computations must be performed on a supercomputer. 



A lot of versions of parallel Thomas algorithms for general tridiagonal matrices [111, [12J, |13|, [1J, ll5|, [lg, 
17 . as well as for Toeplitz matrices 13, 2(| 21 1, have been worked out as of t oday . Algorithms have 



been proposed that take into account the presence of diagonal dominance 22] . In [23j ,on the example of 
the realization of the ADI method, the approach was considered in which combination of computations and 
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interprocessor exchanges makes it possible to increase the paralleling efficiency. In [24| , a method called the 
Dichotomy Algorithm was proposed for solving a series of problems with a constant matrix and different 
right-hand sides 

AX m = F m , m = l,...,M, 

(2) 

A = tridiag{..., Cj, bi, a,, ...}, i = 1, ...N, c\ — ajy = 0, 

where M is the number of problems in the series. The advantage of the Dichotomy Algorithm consists in that 
it makes it possible to attain a thousandfold speedup for problem ^ not only in theory but also in practice. 
Aiming at a further development of this approach, we, based on a modification of the Dichotomy Algorithm, 
will propose a parallel algorithm for solving not only a series but a single system of linear algebraic equations 
(SLAE) with Toeplitz matrices. 

A sufficient condition for the applicability of the Dichotomy Algorithm is the diagonal dominance of 
the matrix of the SLAE. As regards accuracy, the number of arithmetic operations, and the number of 
communication interactions, the Dichotomy Algorithm is almost equivalent to the Cyclic Reduction method 
[25L [26l| However, for comparable data volumes, the real time of interprocessor interactions in the Dichotomy 
Algorithm is much smaller. This is explained by the fact that all interprocessor exchanges may be carried 
out via a sequence of calls to the collective operation " All-to-One-Reduce(+)" (27l|.The account of such 
properties as the associativity and commutativity of the operation " +" taken over the distributed data makes 
it possible to reduce the time of communication interactions by optimizing them[28l. I29I [3oL l3ll l32j .While the 
organization of exchanges via a multiple call to the nonblocking]] function " All-to-One-Reduce(+)" makes 
it possible to reduce the time of synchronization of processor elements (PE). Indeed, if, in one group of 
processors H, there exist two free PEs with prepared data then the execution of the collective operation 
"(+)" over this group of processors can start to even if the previous call on all processors is not finished. 
Thus, the structure of the Dichotomy Algorithm yields ample opportunities for the minimization of the data 
transfer time as well as the PE synchronization time. In the cyclic reduction method, the fixed order of 
elimination of the unknowns, on the one hand, restricts optimization of communication interactions, and on 
the other, requires synchronization of the computations on each reduction order. 

The presence of a preliminary step with O(N) arithmetic operations spent, where N is the dimension of 
the SLAE, does not make it possible to use the Dichotomy Algorithm effectively for solving one problem. 
However, for Toeplitz matrices, an economical preliminary procedure can be constructed with the number 
of the arithmetic operations of order 0(N/p + log 2 p), where p where is the number of the processors. Thus, 
a modification of the preliminary step in the Dichotomy Algorithm enables us to effectively solve not only 
a series but a single SLAE with a Toeplitz matrix. 

The structure of the article is as follows. In Section 2, we expose the Dichotomy Algorithm for the 
general case. In Section 3, we propose an economical preliminary procedure for the inversion of tridiagonal 
matrices in the Toeplitz class. We consider the problems of the stability of the dichotomy process and the 
accuracy of the so-obtained solution for systems with or without diagonal dominance. In Section 4, we give 
the results of numerical experiments. Section 5 is devoted to the summary of the work we have done. 

2. The Dichotomy Algorithm 

Before starting the exposition of the Dichotomy Algorithm, consider the question of mapping the data 
of the problem onto many processors. 

2.1. Data decomposition 

Let p be the number of the PEs. Partition the vector of the right-hand side and the solution vector F 
and X into subvectors Qi, Ui as follows: 



1 This requirement is necessary for the realization of the Dichotomy Algorithm (see Subsection 12.511 
2 A communicator in the terminology of the MPI. 
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F = (Qi, Q 2 , Q p ) = (/i, /2, /aiie{P}-l}/ai»e{P}) ) 



(3) 



T / \ T 

X = (Ui,U 2 , ...,U P ) = (xi,a;2, a; s fae{x}-i, ^ s fae{X}j • 



(4) 



Denoting by size{V} the number of the components of a vector V, require the fulfillment of the following 
conditions: 



size{Qi} = size{XJi} > 2, 



1, 



,p, 



J2i=i size{Qi} = J2i=i size{\5i] = size{X}. 

Assume that the pair of subvectors (Qi,Ui) belongs to the PE with the number i and the row of the 
matrix A having number j belongs to the PE containing the pair of the elements (xj, fj) of <|3j) , (|4j) . 
Introduce in addition the following notations: 

• Denote the first and the last elements of a vector V by first{V} and last{V}. 

• Denote by {^4}[ the matrix obtained from a matrix A by throwing off all rows and columns with the 
numbers less than I or greater than t. 

• Denote by {V}J the subvector obtained from a vector V by throwing off the components with the 
numbers less than I or greater than t. 

• Define as a coordinate vector in 

2.2. The Dichotomy of a SLAE 

The Dichotomy Algorithm is a representative of the class of algorithms known as "Divide & Conquer" 
17 , 33|, Hil . On each dichotomy level, the tridiagonal system of equations obtained at the previous step is 
partitioned into three independent subsystems of lesser dimensions (Fig. [5]) by computing the solutions in 
the /irsi{U m }, last{TJ m } components (Fig. Q]). 
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Figure 1: The first and last elements of the solution vector. 

Thus, on the first dichotomy level, we construct two components of the solution vector 

(X) mi = first{U m }, (X) mfl = last{U m }, 



(5) 



situated on the 771th processor. The calculation of these components makes it possible to replace the initial 
system by three independent problems 



{AX}™ 1 - 1 



{F}^- 1 -a mi _ 1 (X) mt e 1 



( X mL + i \ / frn L + l - C;n L +l ( X ) mi \ 

Xm L +2 _ fm L + 2 
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(6) 

(7) 
(8) 




e R = (1,0,0, ...,0) T , e L = (0,...,0,0, 1) T . 

On the second dichotomy level, we similarly separate systems ([6]) and (jSJ. As a result, in [~log 2 p] steps, 
the initial system of equations ^ will split into p independent subsystems of the form ([?])■ The solution 
to (O on each processor may be found independently with the use of any sequential version of the Thomas 
algorithm 0, [s^j for 0{size{K.} / p) arithmetic operations. 

2.3. The Main Formulas 

The process of the calculation of the first{XJ m } , last{\J m } components consists in two steps: a prelim- 
inary step carried out once for all right-hand sides of ^ and the dichotomy process at which the solution 
is computed for each right-hand side. 

At the preliminary step, on the mth processor locally without communication interactions, we compute 
two rows of the matrix A^ 1 

A G m e rn L > G m — e m R > (9) 

where rriL , tur are defined in ((S|) and is an ort. 

The vectors G R,L have the sense of the difference Green's function [36| for the corresponding three-point 
boundary value problem. 

Remark 1. In [24j, the system of linear algebraic equation with symmetric and asymmetric tridiagonal 
matrices were separately considered (Lemma 1 and Section 3.6). However, with allowance for the property 
(y4 T ) _1 = (A~ 1 ) T , the preliminary procedure can be essentially simplified reducing to the calculation of 
the vectors G R,L following This makes possible, on the one hand, not to consider the case with an 
asymmetric matrix and, on the other hand, to exclude the situation like overflow, which may take place in 
explicit calculation of the inverse matrix entries. 

In addition, at the preliminary step, we compute two vectors 

(10) 

7R _ (-i r R ~R ^RA T 

— y 1 ' z m R +\i • ■ • 7 z Af-li Z N) ' 

where the components of the vectors are found as solutions to the systems 
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Figure 3: Auxiliary vectors for the Dichotomy Algorithm and the Partition Algorithm. 

Thus, the costs of the preliminary computations for the Dichotomy Algorithm will constitute size{Z^ h } = 
O(N). Therefore, it is appropriate to apply the Dichotomy Algorithm for solving several SLAE with constant 
matrix and different right-hand sides. In this case, the preliminary computations may be neglected. 

Like in the Partition Algorithm [ljllli!], the Dichotomy Algorithm is based on the superposition principle [37] 
That is, the components of the solution vector are expressed via the sum of the general solution to the ho- 
mogeneous equation and a partial solution to the nonhomogeneous equation. However, in the Dichotomy 
Algorithm, this principle is realized in a somewhat different manner. 



Suppose that Vi ^ m, ||Qi|| = 0. Then the components of (O satisfy the identity [24 1 



( x w = E ( F h ( G -)i . ( x u = E ( p )i (g* 



(13) 
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The remaining components of the solution vector may be determined from the auxiliary vectors Z^' L as 
follows: 



(X) 7 



i < rriL 



(Z*) 4 (X) mR , i>m R . 



(14) 



If we consider all possible cases m — l,...,p when ||Q m || 7^ while Vi 7^ m \\Qi\\ = and then sum up 
the so-obtained solutions (fT4")) . we come to the formula for computing the first, last- elements 



where 



( m—l p 
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(X) fc = first{V m }, 



(X) fc = last{U m }, 



(15) 



Pt= E ( F ),( G ™),> Pt= E ( F ),( G m)i- 



(16) 



Here the indices mn,m,L are found locally on each processor by (|S|). 

Fig. [3] contains the auxiliary vectors for the Partition Algorithm and the Dichotomy Algorithm. The vec- 
tors u m , v m are the solution to the interior difference boundary value problem with respect to the subvector 
U m , whereas the vectors Z^' L are the solution to the exterior boundary value problem. This difference is 
of principle. So, in the realization of the Dichotomy Algorithm, the computation of the first, last - com- 
ponents is reduced to the calculation of the sums (|15p . whereas, in the Partition Algorithm, it is necessary 
to solve a "reduced" 17, [33| SLAE of dimension 2p — 2 by means of some parallel variant of the Gauss 



Elimination Method. Since the computation of the sums on a multiprocessor computing system can be 
realized with a greater efficiency then the Gauss elimination method, the Dichotomy Algorithm makes it 
possible to reach a greater productivity for the problems of the form than the Partition Algorithm. 

2.4- MPI realization of the Dichotomy Algorithm 

Fig. @] contains the MPI-realization of (|T5|) for partitioning the system AX. = F, size{F} = N. 
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Figure 4: The MPI-realization of the Dichotomy Process. 

Assume that the vectors G^ L , Z^ L are already defined at the preliminary step. Then, at step 1 of the 
Dichotomy Process, by (|16[) each processor computes two local quantities /3^' L ', where m is the processor 
number. The arithmetic costs at this step are 0(size{\Ji}) for each PE. In the sequel, in the separation of 
the systems, the quantities /3^ ,L will be recomputed for a number of operations of order 0(1). 

At the first dichotomy level, consisting of steps 2.1-2.2-2.3, we compute two components of (JSJ) with the 
use of (|T5]) as follows. 

At step 2.1, by calling the collective function mpi_Reduce over the communicators Comuii, Comma, wc 
compute the quantities 

m— 1 p 
j=l j=m+l 



G 



At step 2.2, processor m — 1 serial to processor m the quantity £ R and processor m + 1, the quantity 



Now, the sought components situated on the mth processor may be computed as 24 1 



(X) mL = first{V m } = e + £ L S^ + fc, 

(18) 

(X) mfl = last{U m } = e R 7^^ + C h + C- 

Remark 2. The quantities (G^) I (GbS) and (G^ ) /(GjJ,) make it possible to "transfer" the 
boundary condition from the component /irsi to the component Zasi and vice versa, i.e., inside the vector 
U m . The components of the vectors Z R,L cannot be used in this version of the Dichotomy Algorithm since 
they are a solution to the interior boundary value problem with respect to the subvector U m (Fig. [3]). 

Furthermore, to exclude the thus-found components from the system of equations, processor m send^f] 
to processor m — 1 the quantity S L = ( £ L (qR)"*" + /?m j (^m) mr _i an( l to processor m+ 1, the quantity 

^ = (^Snff +/? ™) ( Z ™) m «+i actively 

At step 2.3, the vector of the right-hand side of SLAE © , © j © is modified, and the processors with 
numbers m — 1, m + 1 recompute the quantities 

5L _ aL L ( G m-i) tjj oR _ oR eR ( G " + 1 ) qL ( 19 ) 

Pm-1 — Pm-1 + / qR N i Pm+1 — Pm+1 ' f G L \ ' 

where (X) tH = last{U m -i}, (X) gi = /irsi{U m+ i}, 

On the next dichotomy level, an analogous separation process is applied to subsystems ©,([8]), where 
the quantities ([T^ll are used instead of (3^-1 m+i- Moreover, the steps having numbers s.l, s.2, s.3, (where 
,s = 1, 2, |~log 2 p] ) are equivalent to steps 2.1, 2.2, 2.3. However, on the sth level, 2 S_1 already independent 
systems obtained at the previous step are separated independently 

2. 5. Optimization of the Communication Costs 

Since the number of arithmetic operations at the step of solving the " reduced" SLAE is relatively small, 
the communication cost of the algorithm of solving this subproblem defines the efficiency of the Partition 
Algorithm. The "reduced" system may be solved by the Cyclic Reduction Method [l3|, [38|. In [24| it was 
shown that the estimates of the computation time for the Dichotomy Algorithm and the Cyclic Reduction 
coincide in order if, first, the delay time a before the data transfer is insignificant and, second, not one but 
several series of problems are solved simultaneously 

T mchotomy = a [bg2(p) + !] loga(p) + l ( log2(p) _P_iy i + 20) ~ 

«21og 2 (p) (ij8 + fry/2), 

T Cy die Reduction = 2 \ og ^p) ( a + ^ + ty) ~ 2 \og 2 (p) (1/3 + Z7) . 



3 This exchange is designated as "First" in Fig. [2] 
4 This exchange is designated as "After" in Fig. [2] 
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Here /3 is the time of transfer of a real number from one PE to another, 7 is the time spent on the 
operation of addition of two numbers, I is the number of the series of the simultaneously solved SLAEs. 

In comparison with the Partition Algorithm + the Cyclic Reduction, the Dichotomy Algorithm gives a 
higher speedup coefficient due to lesser time costs for the synchronization of the computations as well as on 
the data exchange between the PEs. 

The reduction of the communication time is possible due to the fact that the main communication 
operation ("+") of the Dichotomy Algorithm is associative. The architecture of modern supercomputers is 
such that the time of pairwise interactions may differ substantially for different processors EH, EH E|. The 



associativity of the computations makes it possible to define the order of the interactions of the PEs on the 
level of a communication library or a programming language so as to minimize the time of data exchanges 
by taking into account the architecture of the supercomputer. 

The reduction of the interprocessor exchange time is not a sufficient condition for the efficiency of 
the parallel algorithm since it is necessary to take into account the presence of synchronization of the 
computations. It often happens that most of the PEs expect the computation results from several PEs, 
which leads to the decrease of the total efficiency due to down-time periods of the computational resources. 
In the context of the Dichotomy Algorithm, the problem of minimization of the synchronization time is 
solved as follows. First, the number of groups of interacting processors increases with the number of 
the dichotomy level but the number of processes in each of them decreases; hence, so does the time of 
interprocessor exchanges. Second, the Dichotomy Algorithm almost does not require to synchronize the 
computations in passing from one dichotomy level to another (especially, on the first levels, where the 
number of communications is maximal). Thus, for beginning the process of partitioning the system of 
equations on dichotomy level s + 1, it is first necessary to modify the quantities (|19[) at step s.3. However, 
the processors that do not compute the f3^' L at step s.3 may start the summation of the series ([T7|) at step 
(s + l).l without waiting for the result of the previous steps. Thus, on a majority of the processors, steps 
s.2, s.3, and (s + l).l may be carried out with a high degree of parallelism. 

3. The Parallel Dichotomy Algorithm for Toeplitz Matrices 

The realization of the Dichotomy Algorithm on a parallel computational system may be regarded as the 
solution of two separate subproblems: 

1. Minimization of the number of arithmetic operations in the computation of the auxiliary vectors 

7R,L pR,L 

2. Minimization of the time of the communications and synchronization in the realization of (I15[) . 

Since, at the second step of the Dichotomy Algorithm, the form of the SLAE does not matter, we, in 
order to guarantee the possibility of solving not only a series but a single equation with matrix of the form 
(HJ, pose the problem of reducing the number of arithmetic operations at the preliminary step. Note that, in 
view of (fT5)) . (jl6D we do not need to determine all components of the auxiliary vectors but only those used. 
Thus, we need to find only 0(N/p) components of G^' L and |~log 2 p] components of Z^ ,L . 

3.1. Optimization of the Preliminary Computations 

For a general matrix ©, finding the necessary components z i ' ,p i ' from (l^)). (|10p requires O(N) arith- 
metic operations. For Toeplitz matrices, the components of the auxiliary vectors may be found in a lesser 
number of operations in accordance with the following theorem. 

Theorem 1. @, Q3> 0/ Assume that we need to solve a SLAE 

TY = F 

with matrix (QP of order N . Then the nth component of the solution may be calculated as 

(n n V l - fc (n N + 1 - n n N + 1 -' n \ ( n k n k \ f N (n N + 1 - k „N+l-k\ frt» n n \ f 

In n \ (n N + 1 n N + !\ ' U ^ (n n \ fn N + 1 n N + 1 \ ' U ' ^ ' 
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Moreover, the solution does not exist if q^ +1 — q^ +1 but q\ =/= Q2- 

Involving the fact that the right-hand sides in (l^)). (jlip . (ll2D may contain a unique nonzero component, 
it is easy to prove that, for computing one component of the vectors Z^ ,L , G^ ,L , by (j20|) we will need O(l) 
arithmetic operations. Therefore, at the preliminary step as well as at the step of the partition of the SLAE, 
it will be necessary to carry out 0(N/p + log 2 p) arithmetic operations. Thus, the Dichotomy Algorithm for 
Toeplitz matrices may be applied not only to a series but to a single problem. 

A similar economic preliminary procedure may be constructed for matrices of a somewhat more general 
kind than ([1]). For example, in |lOl |42| . explicit expressions are given for computing the elements of the 
inverse matrix to a matrix of the form 



t-i 



to 



V 







to 
t-i 



h 

to + X J 



(21) 



Thus, if, for some tridiagonal matrix, it is possible to compute a separate component of the vectors 
Zj^ L , G^ ,L for O(l) arithmetic operations then, in this case, it is possible to effectively solve a series as well 
as a single problem. 

3.2. Inversion of the Operator V 2 

In solving the first boundary value problem for the Poisson equation by such methods as the Multigrid 
Line Relaxation, the Variable-Separation Method etc., it is necessary to solve a series of equations of the 
form 



h 2 



+ X kVl = -fi, X k e R 1<*<JV-1, fc = l,...,M, 



(22) 



Consider an economic algorithm for computing of the components of the vectors Z^' L ,G^' L for problems 
(|22|) . It is known [l(J Ell that the solution to (|22|) is given by ([20| and may be written down in the form 



U N -n-\{x) 
U N -i(x) 



n-l 

/Ji + ^£4-i(z)/(fc) 

k=l 



Un-l(x) 



N-l 



k—n 



(23) 



where x = 1 — h 2 X/2 ^ cos-^, k — 1,2,..., A — 1 and U n (x) is the Chebyshev polynomial of the sec- 
ond kind of n degree [4J, |45 1 . 

With ((5]) taken into account, the solution to (fTU)) on the mth processor has the form 



R = UN-i-l(x) < . < N 



yL _ Ui-i(x) 



1 < i < UlL- 



The solution to © has the form 



(24) 



9i = t^il ^t-'W' m L <i<m R ,, 



9f = UM-i(x) U N-m R -i{x), m L <i< m R . 



(25) 



Since the necessary components of G^ ,L are situated successively (fTo]) , having computed gm^,m L by (ESJ) , 



it is more economic to compute the remaining components by solving the systems 



i mo-1 



(26) 
(27) 



3.3. Accuracy analysis 

Since the values of Uk{x) outside the interval [—1,1] begin to increase rapidl y w ith k, we cannot exclude 
an overflow- type situation in a program realization of (12"4"1) . (|2"5j) for |x| > 1 46, [47| , To overcome this 
difficulty, we do the following. Let Nq be the degree of the polynomial greater than which the quantity 
Uk{x), k > N , \x\ > 1 cannot be computed because the result exceeds the boundary of the admissible 
values of the real variable. A Chebyshev polynomial of the second kind admits the representation [Zij 



U n {x) 



Then the relation 



1 



2Vx T ^l 



(x+y/. 



x 2 - 1 



n+l 



(x + Va 



-(n+l) 



X > 1. 



(x + \/x 2 - l) ( + ) < (x + \/x 2 - l) 1 
holds Vn > ATo, from which Vn > iVo we may assume with a good accuracy that 

^)-^=t(^^-t)" +1 . 

Now, substituting (30) into (25) and collecting similar summands, we infer 
l 1 I" ( 



(n+l) 



(28) 
(29) 

(30) 



+ 1 . ^(mt-i) _|_ ^(-2JV-i+mx,) _ ^(-2iV+mz,+i) _ ^(-mt-i)! 



0, 



2\/x 2 -l 



(i-ms) _|_ r j(-2N+i-m R ) _ ^(-2JV+TO R +t) _ ^(— i-mje)! 



m L < i, 



(31) 



where 77 = x + v?-T. 

Since the exponents in (|3ip are at most zero and |x| > 1, the situation of an overflow of variables is 
excluded. 

Remark 3. In solving tridiagonal systems of dimension less than Nq, the computations of the components 
of the auxiliary vectors should be performed by (|2~41 . (|2~5j) . (|2"5)) so as to avoid loss of accuracy because of a 
possible violation of ([2"9"| . 

Remark 4. If \x\ < 1 then Uk{x) < k + 1. Hence no overflow of variables arises for x ^ cos-^, k = 
1, 2, iV — 1 and reasonable N. 

3.4- Stability analysis 

It is proved in 24] that a sufficient condition of stability for the Dichotomy Algorithm is given by the 
diagonal dominance of the matrix of the SLAE 



N > K 



i = 2,...,N-l, 



(32) 



|6i|>|oi|, \b N \>\c N \, 
and at least one of the inequalities is strict. 



(33) 
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For problem (j2"2")l in the case of A < 0, the matrix of the SLAE has diagonal dominance, which guar- 
antees the stability of the Dichotomy Algorithm. However, for A > 0, there is no diagonal dominance and 
accumulation of roundoff error can happen. This follows from the fact that, by the failure of the maximum 
principle [3(1 the components of the vectors Z^ ,L may have their modulus greater than one (Fig. [5]). 
In this case, the error of the computation of the first, Zast-components committed on the sth dichotomy 
level passes to s+ 1 level by means of (|19p and is then "strengthened" by multiplying by |Z^ L | < 7 in (fT5|) . 
Thus, if the number of the dichotomy levels is great and 7 is much greater than 1 then an accuracy loss is 
possible. 



30-, 




6000 6500 7000 7500 8000 6000 6500 7000 7500 8000 



(Z R )i (Z R )i 

a) A fc > A fc+ i > b) A fc+ i < X k < 

Figure 5: The values of the components of Zi^ for the case N = 8192, p = 4 in solving (122 D for different values A. 



If we take the estimate 

I 

Si = l[\{z^)\s<j l e, (34) 
i=i 

as the maximal error of the computation of the component (X) is where I is the number of the dichotomy 
step at which the sought component will be computed, e is the initial error. Then, for example, for p = 
4096, 1 — 12 and 7 < 3, the computation error at worst goes six digits to the right, which is acceptable for 
many problems. Thus, for 7 > 1, we need to perform an a priori (|34[) as well as an a posteriori check of the 
accuracy of the so-obtained solution. 



4. Numerical Experiments 

For estimating the efficiency of the above-proposed modification of the Dichotomy Algorithm for Toeplitz 
matrices, consider the following boundary value problem 

Au + \u = f(x), x G G = {0< x a < 1, a= 1,2}, A e R, 

(35) 

tt| r = 0. 

For solving the difference problem in Fortran — 90 with the use of the communication MPI-library, we 
have realized the variable separation method following the scheme of [24]]. The approximation of f|35[) 
was carried out with the second order of accuracy on a uniform mesh with the number of the points 
Ni = N2 = 2 , k = 13. ..16. For estimating the cots of the preliminary computations, the calculation 
of the vectors G^' L , Z^' L for problem (|35l) was realized in two variants: by for an ([9^. (ITT|) . (|12I) arbitrary 
symmetric matrix and by (|24H27[) . (|31D for Toeplitz matrices. Since the matrix of the SLAE is not used 
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at the second step of the Dichotomy Algorithm, one MPI- realization of ([T5]) . (fTB]) is enough for these cases. 
Test calculations were performed on an MBC-lOOk supercomputer of the Interdepartmental Center of the 
Russian Academy of Sciences; the supercomputer is based on Intel Xeon four-core processors operating at 
3 GHz in the Infiniband communication environment. 

The results of the numerical experiments with the number of the processors ranging from 32 to 4096 
for A € [-1....2 fc ] showed that the components of the vectors Z^ L , G^ L calculated by (I24j|27j) . (13"TT) and 
(Pl). ([TT]) . p^|) coincide with the computer accuracy. For positive as well as negative values of A, the accuracy 
of the Dichotomy Algorithm for problem ([33)1 is comparable with the accuracy like in the use of the sequential 
version of the Thomas method Q 

The dependencies of the time of the preliminary computations, the time of the dichotomy process 
( T StepT al > T s°cpi' tZ ' T stc P2 ), and the speedup (S S to P2 , s stepi+stc P2 ) on tnc number of the processors are 
given in Tables TTl2l and in Fig.[5J We can see from Tables 1X121 that the time necessary for the computation of 
the vectors G^ ,L , Z^ ,L for general matrices is independent of the number of processors and is proportional 
to the number of unknowns. This is due to the fact that the total dimension of the auxiliary vectors (Fig. [3]) 
has order 0(size{'K}). Thus, if we neglect the fact that the matrices are Toeplitz for (|35)) then the time 
costs for the preliminary computations will be 0(NiN2). For Toeplitz matrices, the use of (|24M27I) . (|3"TT) . in 
comparison with the general case ([9")). (|11[) . (IT21 makes it possible to decrease the computation time of the 
auxiliary vectors (Tgj.™ ™ 1 , Tg"^ 1 * 2 ) by several orders. As a result, it becomes possible to effectively solve 
not only a series but a single system of equations. Since, in solving a single SLAE, we cannot neglect the 
preliminary computations, taking these costs into account, the speedup coefficient is 1.5 — 2.5 as little as 
that obtained in solving a series of problems (Sg t ° C p G 2 ral , S<^?^g t ) 



JVi x iV 2 


8192x8192 


16384x16384 


rp (general 
Stcpi 


« 3.3 sec. 


« 13 sec. 


NP 


rp Toeplitz 
Stcpi 


Tstcp 2 


Sstcp 2 


oTocplitz 

°Stepi +Stcp9 


rp Toeplitz 
1 Step, 


Tstcp 2 


Sstop 2 


g Toeplitz 

Stcpi +Stcp9 


32 


0.2 


3.9e-01 


28 


19 


0.75 


1.6 


33 


23 


64 


0.1 


1.9e-01 


58 


38 


0.33 


0.84 


63 


45 


128 


5.6e-02 


1.0e-01 


110 


71 


0.17 


0.4 


126 


93 


256 


3.7e-02 


4.9e-02 


224 


127 


0.12 


0.2 


265 


166 


512 


3.0e-02 


2.6e-02 


423 


196 


8.0e-02 


0.12 


664 


265 


1024 


2.6e-02 


1.6e-02 


687 


250 


6.5e-02 


6.5e-02 


817 


408 


2048 


2.6e-02 


1.4e-02 


785 


275 


5.8e-02 


4.2e-02 


1264 


531 


4096 


2.5e-02 


1.6e-02 


687 


268 


5.5e-02 


6.0e-02 


885 


461 



Table 1: The dependence of the computation time T and the speedup S on the number of processors. Here Tg°*p lltz , T^™"' 211 
are the times of the preliminary computations for matrices JTJ and general matrices respectively; Tg tep2 is the time of the 
implementation of the dichotomy process for any tridiagonal matrix; Sstep 2 i g the speedup for one right-hand side without the 
preliminary computations; Sg°°p '|£g t is the speedup for one right-hand side for Toeplitz matrices. 

In Fig. we give the dependence of the speedup coefficient on the number of processors. We see that 
the Dichotomy Algorithm guarantees an speedup close to that linear. For a large number of processors, no 
substantial fall of efficiency arises because of the predominance of interprocessor exchanges. This is ensured 
by dynamic optimization of the communication interactions. Indeed, calling the function MPI -Reduce for 
the first time, we can collect information about the time of the interaction between the processors, the 
volume of the data transferred etc. for each communicator. This will make it possible to optimize the 
processes of data exchange for subsequent calls of M P 'I -Reduce [28l 0, 0, |3l|. The dependence of the 
computation time on the number of processors in the case of the use of dynamic optimization and without 
it is given in Fig. [7J For a small number of processors p < 512, dynamic optimization does not influence the 
computation time much since, in this case, the computational costs exceed those communicational. However, 



5 The solution of systems (0 obtained in the dichotomy process for A > was carried out by the "non-monotonic elimination" 
method [f|. 
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Stcpi 
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NP 


rp Toeplitz 
Stcpi 
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rpXoeplitz 
Stcpi 
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Sstcp 2 


o'-L'ocplitz 
Stcpi +Stcp9 


128 


0.81 
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256 


0.41 


0.89 


254 


174 


1.62 


3.83 






512 


0.24 


0.58 


390 


276 


0.84 


1.92 


510 


355 


1024 


0.17 


0.28 


809 


503 


0.51 


1.0 


980 


653 


2048 


0.13 


0.15 


1510 


809 


0.36 


0.59 


1661 


1032 


4096 


0.12 


0.10 


2265 


1029 


0.30 


0.35 


2801 


1508 



Table 2: 



as the number of the processors increases, the effect caused by optimization becomes more than substantial 
(Fig.©. 

Thus, the high efficiency of the Dichotomy Algorithm is provided, on the one hand, by the low costs 
of the synchronization of the computations, and on the other, by the possibility of reduction for the data 
transfer time by static and dynamic optimization of the communication interactions. 




5. Conclusions 

In this article, we have proposed a parallel algorithm for solving tridiagonal systems of equations with 
Toeplitz matrices demonstrating high efficiency including for thousands of processors. The method is based 
on the Dichotomy Algorithm devised for solving a series of tridiagonal systems of linear equations with 
constant matrix and different right-hand sides. The fact that, for Toeplitz matrices, each component of 
the solution vector may be calculated without solving the SLAE makes it possible to substantially reduce 
the computation time at the preliminary step of the Dichotomy Algorithm. As a result, it became possible 
to effectively solve not only a series but a single system of equations. The reduction of the time for the 
preliminary computations is also very important in solving a series of problems with a large number of 
unknowns, since, if we disregard the special structure of the matrix, the delay before the direct start of 
solving the equation may be substantial. 

In some algorithms considered above, the presence of diagonal dominance for the matrix of the SLAE 
is taken into account, which enables us to reduce the number of interprocessor exchanges. Sometimes 
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Figure 7: The influence of the dynamic optimization of 
interprocessor interactions on the computation time. 



Figure 8: The ratio of the time of solving the problem 
for the first right-hand side to the computation time for 
subsequent ones. 



this imposes the constraint that the size if systems of the form of (JJJ) must be at least some threshold 
value; otherwise, we have an accuracy loss [22]. Thus, the maximal number of the processors that may 
be used for solving the problem depends on the presence of diagonal dominance. If we use the Dichotomy 
Algorithm, such dependencies do not arise, and the solution satisfies (with the computer accuracy in mind) 
the initial system of equations for any number of the processors. As a result, the process of paralleling of the 
already existing numerical methods including the inversion procedure of a tridiagonal SLAE is substantially 
simplified. 

Computational experiments confirm that dynamic optimization on the level of the MPI-library makes it 
possible to substantially reduce the time of interprocessor exchanges. The effect caused by optimization is 
especially noticeable in computations with the use of a large number of processors. Thus, the above-proposed 
algorithm for solving tridiagonal systems of equations with Toeplitz matrices makes it possible to attain a 
high computation speed in a wide range of processors and guarantee high transferability of the software. 
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